Feat/tool sequences by tianmu-li · Pull Request #285 · mlcommons/endpoints

tianmu-li · 2026-04-17T15:11:48Z

What does this PR do?

Updated multi-turn implementation for #232. Added tool sequencing, fixed scheduler for concurrent requests.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-04-17T15:11:59Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces a comprehensive multi-turn conversation benchmarking framework, including a new MultiTurnScheduler, ConversationManager, and MultiTurnDataset. These additions enable benchmarking of conversational AI workloads with turn sequencing, conversation history management, and optional concurrency control. My review identified potential issues regarding the usage of sentinel objects in the scheduler and the robustness of the timeout logic in the conversation manager.

arekay-nv

Partial review - will complete later.

- Remove sequential conversation mode (redundant with target_concurrency=1) - Remove `enabled` field from MultiTurnConfig; presence of multi_turn: block implies enabled - Add conversation grouping validation to MultiTurnDataset (raises InputValidationError if rows for a conversation_id are not consecutive) - Update YAML example configs: model placeholder, relative dataset paths, removed redundant metrics.collect - Move MULTI_TURN_QUICKSTART.md to docs/ - Update all documentation to remove sequential mode references Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…wer comments - Remove dead constant BLOCK_ON_PREVIOUS_TURN = -1 from scheduler.py - Remove redundant outer with state.condition: in mark_turn_complete - Remove ConversationMode import and explicit mode= args from integration tests - Fix format: jsonl → format: ".jsonl" in example YAMLs and docs - Add target_concurrency: 1 clarification to quickstart (preserves turn ordering) - Remove broken HYBRID_SCHEDULER_GUIDE.md reference from quickstart Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

arekay-nv · 2026-04-27T21:50:06Z

⚠️ Superseded. This comment was posted while a pending review was blocking inline-comment delivery. The review has since been re-posted with all 14 inline comments attached: see review #4184417968 and the updated summary. The findings are unchanged.

arekay-nv

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex review failed with a CLI config error — invalid 'features' requirement 'browser_use' from cloud requirements) | Depth: thorough

Found 15 issues (0 critical, 2 high, 5 medium, 8 low). 14 posted as inline comments, 1 in summary table only (line outside diff hunk).

arekay-nv · 2026-04-27T21:54:16Z

Review Council — Multi-AI Code Review

Reviewed by: Claude | Depth: thorough | Commit: 0a7ad37

14 inline comments posted in review #4184417968. 1 finding (openai_adapter.py:131) is summary-only because the line falls outside the diff hunk.

Found 15 issues (0 critical, 2 high, 5 medium, 8 low). Codex review failed with a CLI config error (invalid features requirement browser_use from cloud requirements) — Claude-only review.

🟠 Must Fix (high)

Issues that will cause incorrect behavior users will hit in normal usage.

#	Location	Category	Summary
1	`src/inference_endpoint/openai/openai_msgspec_adapter.py:222`	bug	`from_endpoint_response` violates the `QueryResult.metadata` type contract by passing `metadata=None` when no…
2	`src/inference_endpoint/load_generator/multi_turn_strategy.py:137`	error-handling	`await asyncio.gather(*tasks, return_exceptions=True)` silently swallows every exception raised by a `_conv_pipeline` coroutine. The result…

🟡 Should Fix (medium)

Real issues that trigger under specific conditions, or design flaws that will compound.

#	Location	Category	Summary
3	`src/inference_endpoint/load_generator/multi_turn_strategy.py:164`	bug	When a turn times out waiting for the previous turn's response, the pipeline does `state.failed_turns += 1` but never increments…
4	`src/inference_endpoint/load_generator/session.py:206`	data-integrity	`PromptData.text = data.get("prompt")` and `token_ids = data.get("input_tokens") or data.get("token_ids")` — neither is set on multi-turn…
5	`src/inference_endpoint/dataset_manager/multi_turn_dataset.py:212`	bug	`system_prompts_by_conv` is keyed by `str(conv_id)` (line 212), but `pre_built_messages_by_key` and `current_turn_messages_by_key` are…
6	`src/inference_endpoint/load_generator/multi_turn_strategy.py:180`	bug	In live-history mode (`use_dataset_history=False`), the per-turn `tool` message reuses the dataset's hardcoded `tool_call_id` (e.g.…
7	`src/inference_endpoint/load_generator/conversation_manager.py:142`	error-handling	`mark_turn_complete` and `mark_turn_failed` raise `KeyError` if `conversation_id` is missing. These are invoked from…

🔵 Consider (low)

Valid improvements that could be follow-ups.

#	Location	Category	Summary
8	`src/inference_endpoint/config/runtime_settings.py:200`	design	`self.load_pattern.type.value == "multi_turn"` compares the enum value as a string literal instead of the enum:…
9	`src/inference_endpoint/config/schema.py:253`	design	`MultiTurnConfig` uses `model_config = {"extra": "forbid"}` (raw dict) and is missing `frozen=True`, while every other config model in this…
10	`src/inference_endpoint/dataset_manager/multi_turn_dataset.py:222`	bug	When iterating `prior_rows` to build pre-built messages, a tool row whose `tool_results` field is an empty list (e.g. `tool_results: []`)…
11	`tests/integration/test_multi_turn.py:559`	testing	The "tool use" integration tests verify that `tools`, `tool_calls`, and `tool_results` are forwarded to the endpoint and that all client…
12	`src/inference_endpoint/dataset_manager/multi_turn_dataset.py:142`	performance	`_validate_conversation_structure`, `_validate_turn_numbering`, and `_build_metadata` each call `self.dataframe.groupby("conversation_id")`…
13	`src/inference_endpoint/load_generator/conversation_manager.py:45`	concurrency	`turn_done: asyncio.Event = field(default_factory=asyncio.Event)` is constructed at `ConversationState` instantiation time, before…
14	`src/inference_endpoint/load_generator/multi_turn_strategy.py:95`	design	`self._inflight: dict[str, str] = {}` is mutated from two execution contexts (the per-conversation pipeline tasks at line 190, and…
15	`src/inference_endpoint/openai/openai_adapter.py:131` (summary-only)	api-contract	`OpenAIAdapter.from_endpoint_response` (the non-msgspec adapter, used when callers explicitly select it) does not extract `tool_calls`,…

…tation Fix 15 review issues across severity levels: - HIGH: metadata=None crash in msgspec adapter, silent exception swallowing in gather - MEDIUM: timeout state consistency, conv_id canonicalization, PromptData fallback, conv_id guard - LOW: enum comparison, frozen config, empty tool_results warning, adapter metadata extraction, groupby deduplication, live-history tool warning, asyncio.Event docs, test TODO Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

tianmu-li · 2026-04-29T14:39:01Z

Hi @arekay-nv, I've addressed the comments. Appreciate it if you could take another look.

Copilot

Pull request overview

This PR updates the benchmarking system to support multi-turn conversational workloads (including tool-calling sequences), adds a dedicated multi-turn dataset format + conversion/validation tooling, and wires a new multi-turn load strategy into the benchmarking session and OpenAI adapters.

Changes:

Add MultiTurnDataset (flat-row JSONL format) with validation, metadata precomputation, and adapter-default handling for per-turn parameters/tools.
Add MultiTurnStrategy + ConversationManager to enforce per-conversation turn sequencing with optional global concurrency limiting, and integrate it into BenchmarkSession.
Extend OpenAI request/response handling for messages, tools, tool-call metadata, and streaming tool-call accumulation; add extensive unit/integration tests and multi-turn docs/examples/scripts.

Reviewed changes

Copilot reviewed 43 out of 44 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/unit/openai/test_openai_adapter.py	New unit tests for OpenAIAdapter tool serialization and tools forwarding.
tests/unit/openai/test_msgspec_adapter.py	New unit tests for msgspec OpenAI adapter tool-call fields and message dict conversion.
tests/unit/load_generator/test_multi_turn_strategy.py	New unit tests for turn sequencing, concurrency semaphore behavior, and metadata propagation.
tests/unit/load_generator/test_multi_turn_conversation_manager.py	New unit tests for conversation state bookkeeping and event gating.
tests/unit/dataset_manager/test_transforms.py	Add coverage for the new `AddDefaultColumns` transform.
tests/unit/dataset_manager/test_multi_turn_dataset.py	Comprehensive tests for `MultiTurnDataset`, including tool sequences and metadata correctness.
tests/unit/core/test_types.py	Add tests for `QueryResult.with_metadata()` and `Query.metadata` round-tripping.
tests/unit/config/test_schema.py	Add tests for multi-turn config validation and multi-turn sample counting logic.
tests/integration/test_multi_turn.py	End-to-end integration tests exercising dataset-history/live-history modes and tool-use conversations.
src/inference_endpoint/openai/types.py	Extend msgspec OpenAI types to include `tool_calls`, `tool_call_id`, and `tools`.
src/inference_endpoint/openai/openai_msgspec_adapter.py	Support `messages` input, tool-call fields, `tools` forwarding, and richer response metadata.
src/inference_endpoint/openai/openai_adapter.py	Support `messages` input, `tools` forwarding, and return richer response metadata.
src/inference_endpoint/openai/accumulator.py	Accumulate streamed tool_calls + finish_reason into final `QueryResult.metadata`.
src/inference_endpoint/load_generator/strategy.py	Extend `PhaseIssuerProtocol.issue()` to accept an optional `data_override`.
src/inference_endpoint/load_generator/session.py	Allow injecting a per-phase strategy; support data overrides in sample issuance.
src/inference_endpoint/load_generator/multi_turn_strategy.py	New multi-turn strategy implementing per-conversation sequencing + global concurrency limiting.
src/inference_endpoint/load_generator/conversation_manager.py	New synchronous conversation state manager used by multi-turn strategy.
src/inference_endpoint/endpoint_client/worker.py	Propagate `Query.metadata` through requests and merge into results.
src/inference_endpoint/endpoint_client/http.py	Add `query_metadata` field to `InFlightRequest`.
src/inference_endpoint/endpoint_client/adapter_protocol.py	Generalize SSE decoding/parse APIs to return adapter-specific chunk objects.
src/inference_endpoint/dataset_manager/transforms.py	Add `AddDefaultColumns` (fill-missing-only) transform.
src/inference_endpoint/dataset_manager/multi_turn_dataset.py	New multi-turn dataset implementation with tool-sequence handling and metadata building.
src/inference_endpoint/dataset_manager/factory.py	Select `MultiTurnDataset` when dataset config includes `multi_turn`; skip prompt-based transforms for it.
src/inference_endpoint/dataset_manager/init.py	Export `MultiTurnDataset` and `AddDefaultColumns`.
src/inference_endpoint/core/types.py	Add `Query.metadata` and `QueryResult.with_metadata()`.
src/inference_endpoint/config/templates/online_template_full.yaml	Expose `multi_turn` dataset block and `multi_turn` load pattern option in template.
src/inference_endpoint/config/templates/online_template.yaml	Expose `multi_turn` load pattern option in template.
src/inference_endpoint/config/templates/offline_template_full.yaml	Expose `multi_turn` dataset block and load pattern option in template.
src/inference_endpoint/config/templates/concurrency_template_full.yaml	Expose `multi_turn` dataset block and load pattern option in template.
src/inference_endpoint/config/templates/concurrency_template.yaml	Expose `multi_turn` load pattern option in template.
src/inference_endpoint/config/schema.py	Add multi-turn schema objects and cross-validate dataset.multi_turn ↔ load_pattern.type.
src/inference_endpoint/config/runtime_settings.py	Make multi-turn sample count issue all dataset client turns (min-sample-count aware).
src/inference_endpoint/commands/benchmark/execute.py	Instantiate and wire `MultiTurnStrategy` automatically when using `MultiTurnDataset`.
scripts/validate_jsonl_schema.py	New CLI script to validate multi-turn JSONL rows against schema.
scripts/multi_turn_dataset_schema.json	New JSON Schema for multi-turn flat-row JSONL datasets.
scripts/convert_agentic_snapshot.py	New conversion+verification script from snapshot-style agentic datasets to flat-row JSONL.
examples/09_MultiTurn/multi_turn_with_concurrency.yaml	Example config: multi-turn with global concurrency limiting.
examples/09_MultiTurn/multi_turn_benchmark.yaml	Example config: basic multi-turn benchmark.
examples/09_MultiTurn/datasets/.gitkeep	Placeholder for converted example datasets.
examples/09_MultiTurn/customer_support_conversations.jsonl	Example multi-turn dataset.
examples/09_MultiTurn/agentic_workflow_benchmark.yaml	Example config for converted agentic workflow dataset.
examples/09_MultiTurn/agentic_coding_benchmark.yaml	Example config for converted agentic coding dataset.
examples/09_MultiTurn/README.md	Multi-turn feature documentation and agentic conversion guidance.
docs/MULTI_TURN_QUICKSTART.md	Quickstart guide for running multi-turn benchmarks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

arekay-nv

There is some confusion around the issue policy, lets sync today.

…n implementation - openai_adapter: normalize null content to "" instead of literal "None" to avoid polluting conversation history in tool-calling responses - multi_turn_dataset: validate tool_results entries have required tool_call_id and content fields; raise InputValidationError at load time - multi_turn_dataset: remove unused "index" field from samples metadata - multi_turn_strategy: wrap mark_turn_complete/mark_turn_failed in try/except KeyError in on_sample_complete - multi_turn_strategy: clear _inflight at end of execute() with warning if entries remain (transport failure or session abort) - docs: remove prescriptive concurrency sizing guide; replace with definition of what target_concurrency controls - docs: rename "Long Conversations" to "Conversations with Many Turns" - docs: add dataset validation utility reference in Troubleshooting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fix refusal field set to literal string "None" instead of "" in openai_adapter.py — made downstream refusal checks incorrectly truthy - Add test_pipeline_error_propagated to verify execute() re-raises worker exceptions instead of swallowing them via gather(return_exceptions=True) - Clarify MultiTurnStrategy docstring and MULTI_TURN_QUICKSTART.md: target_concurrency = simultaneous conversations (not requests); each active conversation has exactly 1 in-flight turn at a time - Remove unjustified "Common Configurations" section from quickstart - Correct misleading "workers = concurrent conversations" tip; clarify client.workers and target_concurrency are independent layers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ategy Rewrites MultiTurnStrategy to issue subsequent turns synchronously inside on_sample_complete() (zero event-loop delay), removing pre-spawned worker tasks and per-conversation asyncio.Event waiting. ConversationState no longer holds an asyncio.Event; sequencing is driven entirely by the strategy. Addresses PR mlcommons#285 reviewer request to move turn issuance into the sample-complete handler. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 43 out of 44 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

src/inference_endpoint/endpoint_client/adapter_protocol.py:134

parse_sse_chunk now appends whatever decode_sse_message returns, including None (e.g., when an SSE message has no choices). This contradicts the docstring (“Silently ignores non-content SSE messages”) and forces downstream accumulators to defensively handle None. Consider filtering out None return values here (and/or handling exceptions per JSON doc) so call sites only see meaningful chunk objects.

    def parse_sse_chunk(cls, buffer: bytes, end_pos: int) -> list[Any]:
        """
        Parse SSE chunk and extract all chunk objects.

        Extracts JSON documents from SSE stream and decodes them to chunk objects.
        Silently ignores non-content SSE messages (role, finish_reason, etc).

        Args:
            buffer: Byte buffer containing SSE data
            end_pos: End position in buffer to parse up to

        Returns:
            List of chunk objects extracted from the SSE chunk
        """
        json_docs = cls.SSE_DATA_PATTERN.findall(buffer[:end_pos])
        parsed_contents = []

        try:
            for json_doc in json_docs:
                content = cls.decode_sse_message(json_doc)
                parsed_contents.append(content)
        except Exception:
            # Normal for non-content SSE messages (role, finish_reason, etc)
            pass

        return parsed_contents

src/inference_endpoint/core/types.py:242

The Query docstring’s gc=False note only mentions data/headers, but Query now also has a mutable metadata dict. To avoid future misuse (and to match the more explicit QueryResult guidance), consider updating this note to include metadata as well and/or adding an AT-RISK (gc=False) warning that data/metadata/headers must not be mutated to introduce cycles.

    Attributes:
        id: Unique identifier for this query (auto-generated UUID).
        data: Request payload as a dictionary (typically contains prompt, model, etc.).
        metadata: Internal metadata that round-trips through transport (e.g., conversation_id).
        headers: HTTP headers to include in the request (e.g., authorization).
        created_at: Timestamp when query was created (seconds since epoch).

    Example:
        >>> query = Query(
        ...     data={"prompt": "Hello", "model": "Qwen/Qwen3-8B", "max_tokens": 100},
        ...     headers={"Authorization": "Bearer token123"},
        ... )

    Note:
        gc=False: Safe because data/headers are simple key-value pairs without cycles.
        Do NOT store self-referential or cyclic structures in data/headers fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianmu-li · 2026-05-04T22:55:01Z

Hi @arekay-nv, I've addressed the comments. Main change is refactoring to an event-based model and limiting number of active conversations to concurrency. Appreciate it if you could take another look.

arekay-nv

Almost there - the only issue is the stickiness of the client-turns.

arekay-nv · 2026-05-06T04:25:18Z

@viraatc can you check the perf implications - there might be non-negligible overhead for non-agentic/multi-turn workloads.

…tation - Remove ConversationMode enum (single-member) and mode field from MultiTurnConfig; drop mode: independent from YAML examples and docs - Merge AddDefaultColumns into AddStaticColumns(overwrite=False) - Replace per-call strategy check with construct-time branch in execute.py - Normalize None tool-calling content to "" in openai_adapter.py - Delete unused Query.metadata, QueryResult.with_metadata, and InFlightRequest.query_metadata plumbing - Add role-specific validation in _validate_conversation_structure: tool rows require non-empty tool_results, assistant rows require content or tool_calls - Backfill explicit sample_index into conversation_metadata["samples"]; MultiTurnStrategy reads sample_meta["sample_index"] instead of enumerate - Add AT-RISK gc=False docstring notes to openai/types.py structs with mutable container fields - Rewrite dataset tool_call_ids with model-generated ids in live-history mode; add test_live_history_remaps_tool_call_id integration test - Lift inline imports to top of test_schema.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Copilot

Pull request overview

Copilot reviewed 56 out of 57 changed files in this pull request and generated 15 comments.

Comments suppressed due to low confidence (3)

tests/unit/load_generator/test_multi_turn_strategy.py:110

This monkey-patched issue method replaces FakePhaseIssuer.issue but also lacks the conversation_id and turn keyword parameters that MultiTurnStrategy passes, so this test will fail with TypeError on the first issued turn.
tests/unit/load_generator/test_multi_turn_strategy.py:174
MultiTurnStrategy passes conversation_id and turn into issue(); this TimedIssuer signature does not accept those keywords, so test_turn_ordering_enforced fails before it can validate ordering.
tests/unit/load_generator/test_multi_turn_strategy.py:393
MultiTurnStrategy calls issue with conversation_id and turn keyword arguments. This ErrorIssuer signature rejects those keywords, so the test raises a TypeError instead of the intended simulated pipeline error.

- Revert 595faf4 (conversation_id/turn in EventRecord/PhaseIssuer/session) - Validate user row content and reject assistant(tool_calls)→user transition without intervening tool row in _validate_conversation_structure - Reject malformed tool_calls (present but not a non-empty list) on assistant rows - Guard _expand_tool_results against non-dict entries in tool_results list - Add "tools" to ColumnFilter optional_columns in both OpenAI adapters so single-turn datasets with a tools column are not silently stripped - Replace RuntimeError with logger.warning in _precompute_isl_for_multi_turn so template-incompatible tokenizers fall back instead of aborting the benchmark - Fix schema cross-validation to check only the performance dataset for multi_turn config instead of any dataset - Move seeding loop inside try/finally in MultiTurnStrategy.execute so cleanup runs even if _start_conversation raises - Add inflight/uuid_to_index to FakePhaseIssuer for _handle_timeout coverage - Strengthen test_no_matching_columns to assert unrelated columns are preserved - Update MULTI_TURN_QUICKSTART.md to accurately describe which turns produce sample events and how to correlate events back to conversations Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Match new error messages from the multi_turn schema cross-validation fix that scopes validation to the performance dataset specifically. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Copilot

Pull request overview

Copilot reviewed 54 out of 55 changed files in this pull request and generated 11 comments.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Thread (conversation_id, turn) from the load generator through the event record / SQL writer pipeline so multi-turn runs can group ISSUED, COMPLETE, and ERROR events by conversation in the event log. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Change 0: cast sum_sq to float in _hdr_stat/_exact_stat to prevent msgpack OverflowError for ns-range latencies >= 4.3s. SeriesStat wire schema already accepts int|float. Change 1: stamp conversation_id/turn on RECV_FIRST/RECV_NON_FIRST events using .get() (not .pop()) so the uuid_to_conv_info entry remains available for the terminal QueryResult pop. Change 2: add test_tpot_osl_for_tool_call_complete to TestAsyncTriggers with exact OSL/TPOT value assertions for tool-call streaming responses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 62 out of 63 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (3)

examples/09_MultiTurn/README.md:267

This troubleshooting sequence implies consecutive tool rows are valid (tool -> tool), but MultiTurnDataset's validator rejects tool after tool and the converter merges parallel tool results into a single tool_results row. The documented valid sequence should match the implemented state machine to avoid users producing datasets that fail validation.

For agentic datasets, use the conversion script (`scripts/convert_agentic_snapshot.py`) to
produce a properly sequenced flat-row file. The valid agentic sequence is:

user -> assistant (tool_calls) -> tool -> [tool | assistant (tool_calls)]* -> assistant -> user -> ...

docs/MULTI_TURN_QUICKSTART.md:96

This bullet repeats the old correlation guidance even though sample events now include conversation_id and turn directly. Update it to point users at those event fields instead of requiring a sample_idx_map/conversation_metadata join.

- **Per-turn metrics**: Latency, TTFT, TPOT for each turn
- **Conversation tracking**: events are keyed by `sample_uuid` only; correlate any event back to a conversation by joining through `sample_idx_map.json` and `conversation_metadata["samples"]`

docs/MULTI_TURN_QUICKSTART.md:232

This verification note still tells users to correlate via sample_idx_map, but the event records now carry conversation_id and turn. The example should show/mention those fields so users can validate turn ordering directly from events.jsonl.

```bash
# List all sample UUIDs in the events log
jq -r '.sample_uuid' logs/multi_turn_test/events.jsonl | sort -u
# Should show UUIDs; correlate to conversations via sample_idx_map.json

…, update docs Closes Copilot review comments mlcommons#41, mlcommons#58, mlcommons#81, mlcommons#87, mlcommons#89, mlcommons#90, mlcommons#91 (round 7): - timeout EventRecords: capture uuid_to_conv_info in a single .pop() and pass conversation_id/turn to both ERROR and COMPLETE constructors so timed-out turns correlate like normal completions - live-history guard: add comment explaining why only role:tool rows are rejected; tighten error message wording - MULTI_TURN_QUICKSTART.md: rewrite Events Log section to reflect conversation_id/turn on EventRecords; update validate_jsonl_schema description to clarify row-level vs cross-row validation scope - examples/09_MultiTurn/README.md: update event correlation paragraph; fix agentic role-sequence diagram to [assistant(tool_calls)->tool]*; add note about merged tool_results - offline_template_full.yaml: correct load_pattern comment to "offline only: max_throughput" - scripts/regenerate_templates.py: apply offline-specific override so regeneration preserves the corrected comment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Address remaining dataset-history-scope reviewer concerns from PR mlcommons#285: - MultiTurnDataset: reject null/empty/NaN conversation_id at construction (groupby dropna=False + explicit guard in _validate_conversation_grouping) - MultiTurnDataset: validate per-call tool_calls shape (id, type, function.name, function.arguments) and per-entry tool_results shape (tool_call_id, content) at construction time instead of failing silently at the endpoint - MultiTurnDataset.load(): raise NotImplementedError when adapter= is passed without api_type/model_params (was silently ignored) - MultiTurnStrategy._fill_slot(): wrap in try/except so a transport or dataset error during a timeout/error-triggered slot refill sets _error and _all_done instead of leaving execute() hanging on _all_done.wait() forever - docs: reword stale turn_timeout_s comment and target_concurrency description - tests: add regression coverage for all fixes plus ISL/OSL/TPOT aggregator metrics integration test for multi-turn turns Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ut/error Dropped turns after a timeout or error were silent — absent from events.jsonl, the sample-index map, and the accuracy collector. Introduces PhaseIssuer.register_skipped() (mirrors issue() minus the HTTP send and inflight increment) and wires both drop sites in MultiTurnStrategy through _abort_remaining_turns(), which calls register_skipped() and publishes synthetic failure EventRecords per dropped turn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

transformers' apply_chat_template (used by multi-turn ISL precompute and metrics token rendering) requires jinja2 at runtime; without it ImportError falls through silent text-tokenization fallbacks and quietly degrades ISL/ OSL/TPOT. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 66 out of 68 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (3)

src/inference_endpoint/dataset_manager/multi_turn_dataset.py:334

User content is treated as valid as long as it is non-empty, so non-string values such as numbers or arbitrary objects can pass validation and be forwarded in messages. The adapters only support text or OpenAI multimodal content lists, so validate the content type here instead of only checking for emptiness.

                    content = row.get("content")
                    is_empty_content = (
                        content is None
                        or (isinstance(content, float) and pd.isna(content))
                        or content == ""
                    )
                    if is_empty_content:
                        raise InputValidationError(
                            f"Conversation {conv_id} turn {row['turn']}: "
                            "user rows must have non-empty 'content'"
                        )

src/inference_endpoint/dataset_manager/multi_turn_dataset.py:268

Assistant rows without tool calls are considered valid for any non-empty content value, including numbers or objects. Those values are copied into chat history and can produce invalid requests; validate that assistant content is a supported message content type when it is present.

                    content = row.get("content")
                    is_empty_content = (
                        content is None
                        or (isinstance(content, float) and pd.isna(content))
                        or content == ""
                    )
                    tool_calls = row.get("tool_calls")
                    has_tool_calls = (
                        isinstance(tool_calls, list) and len(tool_calls) > 0
                    )

src/inference_endpoint/dataset_manager/multi_turn_dataset.py:306

Allowing function.arguments to be a dict makes it possible to load a dataset that later sends non-OpenAI-compliant tool call history to the endpoint; chat completion tool_calls[*].function.arguments must be the JSON string form on the wire. Keep any dict conversion limited to tokenizer/template preprocessing and validate dataset messages as strings here.

                            if not isinstance(fn.get("arguments"), str | dict):
                                raise InputValidationError(
                                    f"Conversation {conv_id} turn {row['turn']} "
                                    f"tool_calls[{call_idx}]: "
                                    "'function.arguments' must be a JSON string or dict"
                                )

Parameterize the streaming aggregator test over text and tool-calls-only payloads, and add TTFT to the asserted metric set. Closes the gap where neither unit nor integration tests verified that TTFT fires on a RECV_FIRST whose payload carries only tool_calls. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

tianmu-li · 2026-05-15T22:07:51Z

@arekay-nv @viraatc I've addressed and resolved the comments. Will merge on Monday if no issue.

gemini-code-assist Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread src/inference_endpoint/load_generator/scheduler.py Outdated

Comment thread src/inference_endpoint/load_generator/scheduler.py Outdated

Comment thread src/inference_endpoint/load_generator/conversation_manager.py Outdated

arekay-nv reviewed Apr 20, 2026

View reviewed changes

tianmu-li force-pushed the feat/tool_sequences branch 2 times, most recently from b127845 to 0a7ad37 Compare April 27, 2026 20:09

arekay-nv reviewed Apr 27, 2026

View reviewed changes

tianmu-li marked this pull request as ready for review May 1, 2026 19:16

tianmu-li requested review from a team and Copilot May 1, 2026 19:16

Copilot started reviewing on behalf of tianmu-li May 1, 2026 19:16 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

arekay-nv reviewed May 4, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 4, 2026 22:17

Copilot started reviewing on behalf of tianmu-li May 4, 2026 22:17 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

Comment thread src/inference_endpoint/openai/types.py

Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py

arekay-nv reviewed May 6, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 6, 2026 23:38

Copilot started reviewing on behalf of tianmu-li May 6, 2026 23:39 View session

Copilot started reviewing on behalf of tianmu-li May 14, 2026 21:16 View session

tianmu-li added 2 commits May 14, 2026 21:17

Address perf concerns

1272386

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Merge remote-tracking branch 'origin/main' into feat/tool_sequences

acb7464

Copilot AI reviewed May 14, 2026

View reviewed changes

tianmu-li and others added 2 commits May 14, 2026 22:56

fix: update test_schema.py error message assertions for Fix 5

dd47796

Match new error messages from the multi_turn schema cross-validation fix that scopes validation to the performance dataset specifically. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 14, 2026 23:12

Copilot started reviewing on behalf of tianmu-li May 14, 2026 23:12 View session

Fix ci test failure post merge

15a4108

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Copilot AI reviewed May 14, 2026

View reviewed changes

tianmu-li and others added 4 commits May 15, 2026 02:02

fix: address PR mlcommons#285 round-6 Copilot review comments

2ac66be

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Fix double-firing of timed-out turns

4442364

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Copilot AI review requested due to automatic review settings May 15, 2026 05:52

Copilot started reviewing on behalf of tianmu-li May 15, 2026 05:53 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

tianmu-li and others added 3 commits May 15, 2026 16:18

viraatc approved these changes May 15, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 15, 2026 20:19

Copilot started reviewing on behalf of tianmu-li May 15, 2026 20:20 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

arekay-nv approved these changes May 18, 2026

View reviewed changes

tianmu-li merged commit d9ab48c into mlcommons:main May 18, 2026
7 checks passed

github-actions Bot locked and limited conversation to collaborators May 18, 2026

Uh oh!

Conversation

tianmu-li commented Apr 17, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv commented Apr 27, 2026

Review Council — Multi-AI Code Review

🟠 Must Fix (high)

🟡 Should Fix (medium)

🔵 Consider (low)

Uh oh!

tianmu-li commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

github-actions Bot commented Apr 17, 2026 •

edited

Loading

arekay-nv commented Apr 27, 2026 •

edited

Loading